Learning to Perceive Non-Native Tones via Distributional Training: Effects of Task and Acoustic Cue Weighting

As many distributional learning (DL) studies have shown, adult listeners can achieve discrimination of a difficult non-native contrast after a short repetitive exposure to tokens falling at the extremes of that contrast. Such studies have shown using behavioural methods that a short distributional training can induce perceptual learning of vowel and consonant contrasts. However, much less is known about the neurological correlates of DL, and few studies have examined non-native lexical tone contrasts. Here, Australian-English speakers underwent DL training on a Mandarin tone contrast using behavioural (discrimination, identification) and neural (oddball-EEG) tasks, with listeners hearing either a bimodal or a unimodal distribution. Behavioural results show that listeners learned to discriminate tones after both unimodal and bimodal training; while EEG responses revealed more learning for listeners exposed to the bimodal distribution. Thus, perceptual learning through exposure to brief sound distributions (a) extends to non-native tonal contrasts, and (b) is sensitive to task, phonetic distance, and acoustic cue-weighting. Our findings have implications for models of how auditory and phonetic constraints influence speech learning.


Introduction
Listening to speech involves identifying linguistic structures in a fast, continuous utterance stream and processing their relative ordering to retrieve an intended meaning. In the native language, multiple sources of relevant information are drawn upon to accomplish this task, including knowledge of the vocabulary and the relative likelihood of different sequential patterns at lower and higher levels of linguistic structure, as well as the rules governing sequential processes at the phonetic level. Non-native languages can differ from a listener's native languages on all these dimensions. We here address the question of whether listeners faced with the task of discriminating a novel non-native contrast show preferences as to which dimension of linguistic information they attend to (or attend to most). To better understand the source of any such preference, we compare behavioural against neurophysiological measures.
There is a considerable literature concerning the perception and learning of non-native contrasts, and it naturally shares ground with the very extensive literature on native For infants, bimodal distributional training enhances Mandarin tone perception for Dutch 11-month-olds, but not for 5-or 14-month-olds [28]. For adults, Ong and colleagues [12] found that Australian-English speakers exposed to a bimodal distribution of a Thai tone contrast showed no automatic learning effect; but when the training involved active listening (through a request for acknowledging the heard stimuli), their sensitivity to tone improved. Interestingly, exposure to a bimodal distribution of Thai tones resulted in enhanced perception of both linguistic and musical tones for Australian-English speakers, demonstrating cross-domain transfer [4,29]. Note that linguistic and musical tone perception also tend to be correlated in non-tone language adults' performance [26,30] and in psycho-acoustic perception [31].
All the above results stem from behavioural research because statistical learning research so far has made little use of neurophysiological methods, despite evidence that neural responses can provide processing information at both pre-attentive and cortical levels [32,33]. Compared to behavioural measures, however, neural techniques such as electroencephalography (EEG) have the benefit of requiring no overt attention or decision processes. They are typically sensitive to early pre-attentive responses, reflecting the neural basis of acoustic-phonetic processing [34,35] and providing another measure of implicit discrimination.
Importantly, evidence from non-native speech perception studies shows that nonnative listeners exhibit mismatch negativity (MMN) responses for contrasts they do not discriminate in behavioural tasks [36,37]. The MMN response has been used extensively in speech perception studies to examine how the perceptual system indexes incoming acoustic stimuli pre-attentively, i.e., in a manner not requiring overt attention. Based on the formation of memory traces, the MMN response is a negative-going waveform that is typically observed in the frontal electrodes. It signals detection of change in a stream of auditory stimuli, and specifically indicates differences between stimuli with different acoustics. It is obtained by computing the difference in the event-related potential (ERP) response to an infrequent (termed deviant) stimulus versus a frequent (termed standard) one. This response typically occurs 150 to 250 ms post onset of the stimulus switch [38,39].
Despite the scarcity of neurally-based research on distributional learning, some evidence is available on listeners' perceptual flexibility for the tonal dimension of speech. In tone perception, pitch height and pitch direction are the two main acoustic cues [40]. Nixon and colleagues [41] explored German listeners' neural discrimination of a Cantonese high-mid pitch height contrast, by exposing them to a bimodal distribution of a 13-step tone continuum. Although a prediction of enhanced MMN responses at the two Gaussian peaks was not supported, the listeners' perception of pitch height improved across all steps along the continuum. A follow-up study of listeners' neural sensitivity to cross-boundary differences again showed enhanced sensitivity to overall pitch differences over the course of the tone exposure [42]. In other words, acuity with respect to acoustic pitch differences was increased during distributional learning not only between but also within categories. In contrast to the findings of previous studies testing segmental features, these results indicate that exposure to a bimodal distribution may not necessarily lead to the enhanced discrimination of specific steps or categories along the pitch continuum, but may rather alter listeners' overall sensitivity to tonal changes. A caveat in this interpretation is that these studies did not include a unimodal distribution. Taken together, when EEG studies are adopted to examine tone perception and learning, results often illustrate robust sensitivity not merely restricted to the patterns predicted by frequency distributions as shown in behavioural studies.
On the same note, a recent study examining 5-6-month-old infants' neural sensitivity to an 8-step contrast of flat versus falling pitch (a Mandarin tone contrast) found a surprising enhancement effect after exposure to a unimodal but not a bimodal distribution [43]. This finding was explained in relation to listeners' acoustic sensitivity to frequently heard tokens at peak locations along the tone continuum. The high-frequency tokens had smaller differences in the unimodal (steps 4-5) than in the bimodal (steps 2-7) distributions. The authors of [43] argued that frequent exposure to greater acoustic distance may lead to reduced neural sensitivity to a smaller acoustic distance (steps [3][4][5][6]. This interpretation highlights the role of the magnitude of the acoustic distinctions in the stimuli when prior training and exposure is insufficient to establish phonetic categories, which can be explained by models of non-native perception that focus on acoustic cue weighting and salience (see for instance, [44]).
In the present study, we examined tone perception and learning for non-tonal language speakers, collecting and directly comparing behavioural and neural responses. Australian English listeners with no prior knowledge of any tone language heard a Mandarin tone contrast. Tone perception ability has long been known to vary as a function of the acoustic properties of the tonal input [45]. We varied tone features and the nature of the input distribution (comparing uni-versus bimodality). Experiment 1 tested listeners' ability to discern the level (T1)-falling (T4) Mandarin tone contrast before and after distributional training, using discrimination and identification tasks, respectively. Experiment 2 then examined listeners' neural sensitivity to the same contrast, in a standard MMN paradigm, assessing amplitude, latency, anteriority, laterality and topographic differences between the two modalities. Previous work has shown differences in factors such as anteriority and laterality depending on stimulus characteristics and participant language background [46][47][48] while other work has not [33,49]. We included them here to ensure that we captured the most comprehensive set of results for examining potential effects of distributional learning at the neural level.

Participants
Forty-eight native Australian English speakers naïve to tone languages took part (36 females; M age = 23.06, SD age = 5.57). Nine participants reported having musical training (ranging from 1 to 10 years), though only one continued to practise music at the time of testing. All participants reported normal speech and hearing, provided written informed consent prior to testing, and received course credit or a small monetary compensation.

Stimuli
This study focused on the Mandarin Chinese level [T1] versus falling [T4] tonal contrast. A female Mandarin speaker produced natural tokens of /taT1/ ('take') and /taT4/ ('big') in a soundproof booth. Recording used the computer program Audacity and a Genelec 1029A microphone, with a 16-bit sampling depth and a sampling rate of 44.1 kHz. To create the 8-step continuum, equidistant stimulus steps differing in pitch contour were constructed from /taT1/ to /taT4/ ( Figure 1) using the following procedure in Praat [50]: First, four interpolation points along the pitch contours (at 0%, 33%, 67% and 100%) were marked (Supplementary Material A). Next, the distances (in Hz) between the corresponding points were divided into seven equal spaces, generating six new layers. New pitch tokens were then created by connecting the four corresponding intermediate points on the same layer. This formed a continuum of eight steps (including the endpoint contours) from /taT1/ (step 1) to /taT4/ (step 8). Stimulus intensity was set to 65 dB SPL and duration was set to approximately 400 ms. Five native Mandarin speakers listened to the stimuli and confirmed that they were acceptable tokens of these Mandarin syllables. The contrast (with the same stimuli) has been used in previous studies [51][52][53][54].

Procedure
The experiment consisted of three phases in the following order: pre-test, distribu tional learning and post-test. Task programming, presentation of stimuli and respons recording were conducted using E-Prime (Psychology Software Tools Inc., Sharpsburg PA, USA) on a Dell Latitude E5550 laptop. Auditory stimuli were presented at 65 dB SP via Sennheiser HD 280 Pro headphones and were ordered randomly. No corrective feed back was given. The experiment took approximately 40 min to complete.
In the distributional learning phase, participants were randomly assigned to one o the two conditions: unimodal or bimodal distribution ( Figure 2). The two conditions ha different distributional peaks (a single central category vs. two separate categories) bu were equal in terms of the total amount of distributional learning tonal input (256 tokens and duration (360 s). In the bimodal condition, stimuli from the peripheral positions o the continuum were presented with higher frequency. In other words, participants hear tokens of steps 2 and 7 most frequently. In the unimodal condition, stimuli near the centra positions, namely tokens of steps 4 and 5, were presented most frequently. Crucially, stim ulus steps 3 and 6 were presented an equal number of times in both conditions. At pre-and at post-test, participants were first presented with a discrimination tas in which they indicated via keypresses whether paired lexical tone stimuli steps were th same or different. Trials included same (e.g., steps 1-1, 2-2) and different pairs (e.g., step 2-4, 3-6) along the continuum, with each pair presented 10 times. Trials not involving th target contrast functioned as controls. The inter-stimulus interval between tokens wa 1000 ms.

Procedure
The experiment consisted of three phases in the following order: pre-test, distributional learning and post-test. Task programming, presentation of stimuli and response recording were conducted using E-Prime (Psychology Software Tools Inc., Sharpsburg, PA, USA) on a Dell Latitude E5550 laptop. Auditory stimuli were presented at 65 dB SPL via Sennheiser HD 280 Pro headphones and were ordered randomly. No corrective feedback was given. The experiment took approximately 40 min to complete.
In the distributional learning phase, participants were randomly assigned to one of the two conditions: unimodal or bimodal distribution ( Figure 2). The two conditions had different distributional peaks (a single central category vs. two separate categories) but were equal in terms of the total amount of distributional learning tonal input (256 tokens) and duration (360 s). In the bimodal condition, stimuli from the peripheral positions of the continuum were presented with higher frequency. In other words, participants heard tokens of steps 2 and 7 most frequently. In the unimodal condition, stimuli near the central positions, namely tokens of steps 4 and 5, were presented most frequently. Crucially, stimulus steps 3 and 6 were presented an equal number of times in both conditions.

Procedure
The experiment consisted of three phases in the following order: pre-test, distribu tional learning and post-test. Task programming, presentation of stimuli and respons recording were conducted using E-Prime (Psychology Software Tools Inc., Sharpsburg PA, USA) on a Dell Latitude E5550 laptop. Auditory stimuli were presented at 65 dB SP via Sennheiser HD 280 Pro headphones and were ordered randomly. No corrective feed back was given. The experiment took approximately 40 min to complete.
In the distributional learning phase, participants were randomly assigned to one o the two conditions: unimodal or bimodal distribution ( Figure 2). The two conditions ha different distributional peaks (a single central category vs. two separate categories) bu were equal in terms of the total amount of distributional learning tonal input (256 token and duration (360 s). In the bimodal condition, stimuli from the peripheral positions o the continuum were presented with higher frequency. In other words, participants hear tokens of steps 2 and 7 most frequently. In the unimodal condition, stimuli near the centra positions, namely tokens of steps 4 and 5, were presented most frequently. Crucially, stim ulus steps 3 and 6 were presented an equal number of times in both conditions. At pre-and at post-test, participants were first presented with a discrimination tas in which they indicated via keypresses whether paired lexical tone stimuli steps were th same or different. Trials included same (e.g., steps 1-1, 2-2) and different pairs (e.g., step 2-4, 3-6) along the continuum, with each pair presented 10 times. Trials not involving th target contrast functioned as controls. The inter-stimulus interval between tokens wa 1000 ms. At pre-and at post-test, participants were first presented with a discrimination task in which they indicated via keypresses whether paired lexical tone stimuli steps were the same or different. Trials included same (e.g., steps 1-1, 2-2) and different pairs (e.g., steps 2-4, 3-6) along the continuum, with each pair presented 10 times. Trials not involving the target contrast functioned as controls. The inter-stimulus interval between tokens was Participants then completed an identification task in which they indicated (again via keypresses) whether the tone of each continuum step (e.g., step 3) was a flat tone (indicated by a flat arrow) or a falling tone (indicated by a falling arrow), with each tone presented 6 times. For each task, four auditory examples were played as practice trials prior to testing. Trials were self-paced and presented in random order. Analyses for the discrimination task targeted the perception of steps 3 to 6, and those for the identification task focused on step 3 and step 6. As an additional control, the Pitch-Contour Perception Test (PCPT; [56,57]) was included in the post-training phase, to examine participants' pitch perceptual abilities (Supplementary Material B). This test required indication of whether isolated tone tokens had a flat, rising or falling contour. The PCPT allows allocation -of participants to high and low aptitude groups, to examine whether ability to perceive pitch affects identification and discrimination responses. No differences either in the identification or in the discrimination tasks were observed between listeners with high versus low aptitude in the present study.

Results
We first compared listeners' percentage of accurate choices for the target contrast (steps 3-6) in the discrimination task before and after distributional learning to chance (50%) with a one-sample t-test (Table 1). Neither condition showed discrimination above chance before distributional learning, while after training, listeners in both conditions were able to discriminate the contrast. We then conducted a Repeated Measures Analysis of Variance (RM ANOVA) with condition (2-level, unimodal vs. bimodal) as the betweensubjects factor and accuracy (2-level, pre-and post-training phases) as the within-subjects variable ( Figure 3). The main effect of training was significant (F(1, 46) = 4.731, p = 0.035, η g 2 = 0.093), indicating a difference in accuracy before and after distributional learning. The interaction between condition and test phase was not significant (F(1, 46) = 0.526, p = 0.472, η g 2 = 0.011), suggesting no difference between unimodal and bimodal exposure. In other words, listeners' tone discrimination improved after exposure to either distribution. Participants then completed an identification task in which they indicated (again via keypresses) whether the tone of each continuum step (e.g., step 3) was a flat tone (indi cated by a flat arrow) or a falling tone (indicated by a falling arrow), with each tone pre sented 6 times. For each task, four auditory examples were played as practice trials prio to testing. Trials were self-paced and presented in random order. Analyses for the discrimination task targeted the perception of steps 3 to 6, and thos for the identification task focused on step 3 and step 6. As an additional control, the Pitch Contour Perception Test (PCPT; [56,57]) was included in the post-training phase, to ex amine participants' pitch perceptual abilities (Supplementary Material B). This test re quired indication of whether isolated tone tokens had a flat, rising or falling contour. Th PCPT allows allocation -of participants to high and low aptitude groups, to examin whether ability to perceive pitch affects identification and discrimination responses. No differences either in the identification or in the discrimination tasks were observed be tween listeners with high versus low aptitude in the present study.

Results
We first compared listeners' percentage of accurate choices for the target contras (steps 3-6) in the discrimination task before and after distributional learning to chanc (50%) with a one-sample t-test (Table 1). Neither condition showed discrimination abov chance before distributional learning, while after training, listeners in both condition were able to discriminate the contrast. We then conducted a Repeated Measures Analysi of Variance (RM ANOVA) with condition (2-level, unimodal vs. bimodal) as the between subjects factor and accuracy (2-level, pre-and post-training phases) as the within-subject variable ( Figure 3). The main effect of training was significant (F(1, 46) = 4.731, p = 0.035 ηg 2 = 0.093), indicating a difference in accuracy before and after distributional learning The interaction between condition and test phase was not significant (F(1, 46) = 0.526, p = 0.472, ηg 2 = 0.011), suggesting no difference between unimodal and bimodal exposure. In other words, listeners' tone discrimination improved after exposure to either distribution Table 1. Mean (SD) accuracy percentage and corresponding t and p values in one-sample t-tes against the chance level in bimodal and unimodal before and after distributional learning in th discrimination task. (See Supplementary Material C for descriptive statistics of other contrasts.).  We then examined listeners' choices in the identification task before and after learn ing ( Table 2). Across participants, Step 6 was more often identified as falling than step 3 We then conducted a repeated measures ANOVA with condition (2-level, unimodal vs We then examined listeners' choices in the identification task before and after learning ( Table 2). Across participants, Step 6 was more often identified as falling than step 3.

Mean
We then conducted a repeated measures ANOVA with condition (2-level, unimodal vs. bimodal) as the between-subjects factor, and with percentage of falling choices across phase (2-level, pre-and post-training) and step (2-level, steps 3 & 6) as the within-subjects variables ( Figure 4). The only significant factor was step (F(1, 46) = 57.343, p < 0.001, η g 2 = 0.555). No other factors or interactions were significant (Fs < 1.936, ps > 0.170, η g 2 < 0.040). In contrast to the discrimination outcomes, no trace of improvement was observed in listeners' tone identification after either distributional condition. Step 3 Step bimodal) as the between-subjects factor, and with percentage of falling choices across phase (2-level, pre-and post-training) and step (2-level, steps 3 & 6) as the within-subjects variables ( Figure 4). The only significant factor was step (F(1, 46) = 57.343, p < 0.001, ηg 2 = 0.555). No other factors or interactions were significant (Fs < 1.936, ps > 0.170, ηg 2 < 0.040). In contrast to the discrimination outcomes, no trace of improvement was observed in listeners' tone identification after either distributional condition.

Discussion
Experiment 1 used behavioural measures to investigate how distributional learning of a Mandarin tone contrast affects listeners' tone discrimination and identification. Listeners showed improved discrimination abilities after exposure to either unimodal or bimodal distributions, and no change was observed in their identification patterns.
In the discrimination task, successful distributional learning would predict enhanced discrimination of steps 3-6 after the bimodal condition, and/or reduced discrimination of this step after the unimodal experience. However, our results showed that listeners' tonal sensitivity was enhanced after distributional learning irrespective of the embedded statistical information. Statistical exposure appears to benefit participants more in acoustic than in statistical cues. This interpretation resembles reported EEG studies with a Cantonese tone contrast [41,42], and agrees with findings showing that listeners attend more to prosodic than statistical cues to segment speech streams [20]. It is worth mentioning that only a single (bimodal) distribution was tested in these cited studies.
In identification, listeners' difficulty in anchoring non-native tones to given categories plausibly reflects their lack of tonal categories in the first place (N.B. performance differences in identification versus discrimination have been attested before.) Non-tone

Discussion
Experiment 1 used behavioural measures to investigate how distributional learning of a Mandarin tone contrast affects listeners' tone discrimination and identification. Listeners showed improved discrimination abilities after exposure to either unimodal or bimodal distributions, and no change was observed in their identification patterns.
In the discrimination task, successful distributional learning would predict enhanced discrimination of steps 3-6 after the bimodal condition, and/or reduced discrimination of this step after the unimodal experience. However, our results showed that listeners' tonal sensitivity was enhanced after distributional learning irrespective of the embedded statistical information. Statistical exposure appears to benefit participants more in acoustic than in statistical cues. This interpretation resembles reported EEG studies with a Cantonese tone contrast [41,42], and agrees with findings showing that listeners attend more to prosodic than statistical cues to segment speech streams [20]. It is worth mentioning that only a single (bimodal) distribution was tested in these cited studies.
In identification, listeners' difficulty in anchoring non-native tones to given categories plausibly reflects their lack of tonal categories in the first place (N.B. performance differences in identification versus discrimination have been attested before.) Non-tone language speakers often find it hard to make recognition responses to tones. Their much better tone discrimination ability, in contrast, reflects sensitivity to general pitch information (note that pitch perception in language and music correlate [26,30]).
To further examine the relationship between tone processing, distributional learning, and the cues listeners pay attention to in these processes, Experiment 2 examined listeners' neural changes induced by distributional learning.

Participants
A new sample of 32 Australian English speakers naïve to tone languages participated in Experiment 2 (22 females; M age = 22.9, SD age = 7.60). Eight participants reported having musical training (ranging from 1 to 7 years), though only one continued to practise music at time of test. Participants provided written informed consent prior to the experiment and received course credit or a small reimbursement for taking part.

Stimuli
The same stimuli were used as in Experiment 1, except that the duration of all stimulus tokens was reduced to 100 ms to accommodate to the EEG paradigm.

Procedure
As in Experiment 1, there were three phases: pre-test, distributional learning, and posttest. In the distributional learning phase, participants were randomly assigned to either the unimodal or the bimodal condition, with equal numbers of bilingual and monolingual participants in each condition. The two conditions did not differ in the total number of exposure trials (256 tokens) or duration (360 s) but varied in the frequency distribution (one vs. two Gaussian peaks) along the phonetic continuum only. Stimuli near the central positions were presented most frequently in the unimodal condition, whereas those from the peripheral sides of the continuum were presented with the highest frequency in the bimodal condition. Importantly, the frequency of occurrence of tokens from steps 3 and 6 was again identical across both conditions. An EEG passive oddball paradigm was used for pre-and post-test, in which two separate blocks were presented. Step 3 was standard and step 6 was deviant in one block, and this was reversed in the other. The standard-deviant probability ratio was 80-20%. Since steps 3 and 6 were presented the same number of times in each condition, any potential differences observed in the post-test should be attributed to the condition. The sequence of the blocks was counterbalanced. No fewer than three and no more than eight standard stimuli occurred between deviant stimuli. Each block started with 20 standards and contained 500 trials in total. The inter-stimulus interval was randomly varied between 600 and 700 ms. Following the presentation phase, participants heard as a control (lasting approximately 1 min) 100 instances of the deviant stimuli they had heard in the previous oddball presentation. This design allowed a response comparison of the same number of deviant stimuli in the oddball presentation (N = 100) and the control presentation (N = 100).
Participants were tested in a single session in a sound-attenuated booth at the MARCS Institute for Brain, Behaviour and Development at Western Sydney University. They watched a self-selected silent movie with subtitles during the experiment and were instructed to avoid excessive motor movement and to disregard the auditory stimuli. The stimuli were presented binaurally via Etymotic earphones with the intensity set at 70 dB SPL in Praat [50] and the volume level was set at a comfortable listening level consistent across participants as a result of piloting that showed MMN elicitation. The duration of the EEG experiment was approximately 45 min.

EEG Data Recording & Analysis
EEG data were recorded from a 64-channel active BioSemi system with Ag/AgCl electrodes placed according to the international 10/20 system fitted to the participant's head. Six external electrodes were used: four to record eye movements (above and below the right, on the left and right temple), and two for offline referencing (left and right mastoid). Data were recorded at a 512 Hz sampling rate and we made sure the electrode offset was kept below 50 mV.
Data pre-processing and analysis used EEGLAB [58] and ERPLAB [59]. First, data points were re-referenced to the average of the right and left mastoids. They were then bandpass-filtered with half power cut-offs at 0.1 and 30 Hz at 12 dB/octave. Time windows from 100 to 600 ms post stimulus onset were extracted ("epoched") from the EEG signal and baseline-corrected by subtracting the mean voltage in the 100 ms pre-stimulus interval from each sample in the window. Independent component analyses were conducted to identify and remove noisy EEG channels. Eye-movement components based on the activity power spectrum, scalp topography, and activity over trials were also removed. Noisy EEG channels that were removed were interpolated using spherical spline interpolation. Artefacts above 70 mV were rejected automatically for all channels. Participants with more than 40% of artefact-contaminated epochs were excluded from further analyses (n = 6). The epochs were averaged separately for standards (excluding the first 20 standards and the standards immediately following a deviant stimulus), for each deviant token, and for each control block.
Two difference waves were calculated by subtracting the mean ERP response to each control stimulus from the mean ERP response to its deviant counterpart. These difference waves were then grand-averaged across participants. In the grand-averaged waveform, we sought a negative peak 100 to 250 ms after consonant production (taking the 20 ms consonant portion of the stimulus into consideration) to ensure that we were measuring the neural response to the tone. This resulted in measuring the 120 to 270 ms time window post-stimulus onset. We then centred a 40 ms time window at the identified peak and measured the mean amplitude in that window per individual participant (cf. [47]). These mean individual amplitudes were our measure of MMN amplitude in further statistical analyses. Within the same 40 ms time window, latency was measured by establishing the most negative peak for each participant. These mean individual latencies then became the measure of MMN latency in the further analyses.

Results
Following previous studies (e.g., [47,60]), MMN amplitudes and latencies were measured at nine channels (Fz, FCz, Cz, F3, FC3, C3, F4, FC4, C4) and were analysed in two separate mixed analyses of variance (ANOVAs) with a between-subject factor of condition (unimodal vs. bimodal) and within-subject factors of phase (pre-vs. post-training), anteriority (Frontal (F) vs. frontocentral (FC) vs. central I), and laterality (left, middle, right). Based on activating different neural populations, the dependent variables including mean amplitude and peak latency may reflect different processing mechanisms [61]: the former may show the robustness of listeners' discrimination as well as the acoustic/phonetic difference between the standard and the deviant stimuli, and the latter may reflect the required time to process the difference between the stimuli. Both variables have been used as auditory perceptual processing measures at early pre-attentive levels for native and non-native speech [36,62]. MMN responses tend to occur at frontal (F) and frontocentral (FC) sites. If distributional training has an effect on tone perception, we predicted an increase in MMN amplitude at those sites between pre-and post-test, as evidence of the auditory change in stimuli initiating an involuntary attentional switch [39,63].
MMN mean amplitude. Figure 5 shows  As condition was our variable of interest, the MMN amplitude response at the Fz electrode site in each phase was compared against zero ( Figure 6) following previous literature [49,60,61,64,65]. Participants in the unimodal group exhibited significant MMN amplitudes at pre-test (t (15)   MMN peak latency. A mixed ANOVA with condition (2-level, unimodal vs. bimodal) as the between-subjects factor, and within-subject factors of test (pre-vs. post-test), anteriority (F vs. FC vs. C), and laterality (left vs. middle vs. right) was conducted on the mean MMN peak latency. Main effects of phase (F(1, 30) = 240.35, p < 0.001, ηg 2 = 0.729) and condition (F(1, 30) = 24.57, p < 0.001, ηg 2 = 0.179) were found, which were qualified by a significant phase × condition interaction (F(1, 30) = 19.68, p < 0.001, ηg 2 = 0.177). Post hoc Tukey tests revealed that there were no group differences in latency at pre-test (Mdiff = 0.05, p = 0.986) but at post-test, participants in the bimodal group had significantly earlier As condition was our variable of interest, the MMN amplitude response at the Fz electrode site in each phase was compared against zero ( Figure 6) following previous literature [49,60,61,64,65]. Participants in the unimodal group exhibited significant MMN amplitudes at pre-test (t(15) = −4.55, p < 0.001, d = −1.14) and at post-test (t (15)   As condition was our variable of interest, the MMN amplitude response at the F electrode site in each phase was compared against zero ( Figure 6) following previous lit erature [49,60,61,64,65]. Participants in the unimodal group exhibited significant MMN amplitudes at pre-test (t (15)

Discussion
Experiment 2 showed learning-induced neurophysiological changes in these listeners' tone perception at pre-and post-test. Reduction in tonal sensitivity was significant in the bimodal condition. In addition, the latency results showed an earlier peak in the post-than in the pre-test, with a larger impact again in the bimodal condition. Although behavioural results indicated limited discrimination prior to training, neural sensitivity was consistent with discrimination. Note that the pre-test sensitivity is not unexpected given listeners' psycho-acoustic sensitivity to non-native tones across ages [51], with sensitivity modulated by tone salience [66].
At post-test, although behavioural outcomes indicated improved discrimination, listeners' neural processing was generally weakened in both conditions. The results from the bimodal distribution may suggest that the increased exposure and familiarity with tones in the neural experiment hindered distributional learning. Alternatively, although the acoustic experience had given listeners more familiarity with tones, the information they received was insufficient to establish tonal categories from the frequency distribution.
Indeed, in the bimodal condition where MMN responses were diminished, frequency peaks in the distribution were near the two ends of the continuum (Figure 2, steps 2-7), whereas the peaks in the frequency distribution were at the midpoint (Figure 2, steps [4][5] in the unimodal condition. To process stimuli efficiently, listeners may focus on the most frequently presented (hence, most salient) stimuli in each condition. In the bimodal condition, steps 2 and 7 were highlighted when played alongside stimuli close to the discrimination boundary (steps 3 and 6), and listeners may expect a similar (or larger) difference to detect deviation compared with post-test. Contrasts with a smaller acoustic difference (i.e., steps 3-6) may then be harder to detect. On the other hand, those who were exposed to the unimodal condition, where steps 4 and 5 were the most frequent, would show neural responses to a contrast of larger acoustic difference (i.e., steps 3-6). In other words, the most frequent and prominent steps, namely the peaks of each distribution, would impact subsequent perception (To explore our proposed hypothesis on the impact of frequency peak, an additional behavioural test was conducted measuring Australian listeners' (N = 12) sensitivity to the designated tonal contrasts with manipulations of the acoustic distance between the tokens. While discrimination of steps 2-7, the acoustically distant contrast mostly presented in the bimodal condition, reached 82% accuracy, discrimination of steps 3-6 was at a chance (p = 0.511 against 0.5) and accuracy in the discrimination of steps 4-5, the acoustically close contrast presented most frequently in the unimodal condition, was 20%. These results reflect contrast salience). The overall pattern suggests that listeners' sensitivity is more auditory or psychophysical than phonetic or phonological at this stage: they can discriminate, but fail to establish categories.
With either an explanation in terms of acoustic salience, or an explanation in terms of perceptual assimilation, certain interactions between acoustic and statistical cues in listeners' neural processing will be assumed. Previous research has not only shown humans' ability to track input frequency distributions from the ambient environment, and their ability to abstract and retain the memory of non-native pitch directional cues, but also clear ability to shift their weighting of acoustic/phonetic cues and to reconfigure their learning strategies [19,20]. In a speech segmentation task when both statistical and prosodic cues are presented, listeners attend more to the latter to acquire speech information [21]. Indeed, when various types of cues (e.g., acoustic feature, frequency distribution) are presented to listeners, the weighting of these cues may be dynamic and change in real-time during experimental training [67,68]. Last but not least, the outcomes also confirm that EEG is more sensitive than behavioral measures in revealing listeners' responses in the course of speech perception [34][35][36][37].

General Discussion
This study tested the learning of a non-native tone contrast by non-tonal language speakers. Data were collected with both behavioural and neural test methods, statistical distributions were varied (along a single continuum) in the input, and both identification and discrimination data were collected and analysed. Non-native listeners' tone discrimination improved after both unimodal and bimodal exposure, and there was a reduction in sensitivity after training (observed with the more sensitive measures, to wit, MMN amplitude and latency, which were recorded using EEG). In the EEG data, perceptual differences emerged between the two exposure conditions. The bimodal condition appears to have a more negative impact than the unimodal condition after training. The reduced amplitude and early latency in the bimodal condition may reflect inability to establish categories from the frequency distribution, although listeners had become more familiar with tones after training.
Our neurophysiological finding contrasts with results in prior distributional learning studies (see Figure 2), where a bimodal distribution leads listeners to display enhanced distinction of steps midway along the distribution, while theeffect is reversed with the unimodal distribution, where steps within one peak become less distinct perceptually. We proposed that the lack of sensitivity at post-test after bimodal training is due to the exposure to a salient contrast during training that participants then expected to find again at test. In the absence of distributional learning, there was exploitation of salient acoustic cues instead. This interpretation is in line with the phonetic-magnitude hypothesis [44], which holds that the size of acoustic features (or phonetic distinctions, articulatory correlates) plays a more central role than the extent of native language experience, an interpretation consistent with the Second-Language Linguistic Perception model (L2LP; [69][70][71][72]), which highlights the interaction between listeners' native phonology and the magnitude of the phonetic difference in auditory dimensions. In a way, the adult listeners in the current study behaved like infants without prior knowledge of lexical tones and like adult L2 learners without prior knowledge of vowel duration contrasts. They concentrated on the most salient phonetic cues, while ignoring or lowering the weighting of other cues.
We further argue that compared to segmental features, pitch is a particularly salient feature. Pitch contrasts, regardless of height [41,42] or direction (as illustrated in the current study), may be more prone to be weighted acoustically compared to statistically, especially among non-tone language speakers who predominantly use pitch in pragmatic but not lexical functions. Note that this finding was only seen neurally and not behaviourally, which speaks to the fact that this might only be detected through the deployment of sensitive measures.
When native listeners process lexical tones, they attend to linguistic features such as pitch, intensity, and duration, based on their existing knowledge of the categorical structure and the phonotactics of their language [73]. When non-native listeners are presented with the same input, they have no relevant categorical or phonotactic knowledge to call upon. But their auditory abilities may be assumed to parallel those of the native listener, so that it is not surprising that simple discrimination elicits a similar pattern of performance for native and non-native listeners [25], even when any task involving recourse to frequency distributions leads to very different results from these two listener groups.
Results conform to a heuristic approach to processing: when hearing a subtle nonnative tone contrast, listeners' neural sensitivity may be insufficient to perceive the finegrained tone steps and turn more towards acoustic information, and especially the most frequently presented (i.e., most salient) stimuli within each type of distribution. The general pattern somewhat conforms to previous findings showing that a bimodal distribution does not necessarily promote distinct discrimination between the two peaks [13,38,39], and a unimodal distribution does not always hinder it [43]. Factors such as acoustic properties or perceptual salience between tokens may play a role.
Furthermore, we argue that listeners who were trained on a bimodal distribution in which the peak contrast (steps 2-7) contains a large, discriminable acoustic distance may exhibit hampered discrimination of contrasts that rest on smaller differences. In consequence, smaller differences along the continuum may be disregarded. On the other hand, training on a unimodal distribution in which the peak contrast (steps 4-5) is extremely difficult to discriminate may ease the processing of contrasts with a larger acoustic difference (i.e., steps 3 and 6). This explanation does not come out of the blue, as similar ideas have been proposed for studies showing that infants are sensitive to acoustically subtle phonetic contrasts [74] and develop their word learning abilities of very small differences in vowels in a speedy fashion [75]. This interpretation assumes that listener sensitivity to ambient acoustic and statistical information can pave the way for perception and learning of new contrasts in a second language. However, our results should not be construed as evidence that native English speakers cannot achieve a level of tone categorization matching that of native tone speakers [76]. Many factors may play a role. For example, past research has shown that although statistical information can prompt the formation of distinct phonemic categories within three minutes for some non-native contrasts, longer duration of exposure may be required to trigger learning in a bimodal condition [13]. In a previous study examining listeners' non-native tone discrimination ability, an overall effect of learning surfaced after around ten minutes of exposure [52]. In the present experiment, embedded statistical information altered perception within six minutes of exposure. The take-home message for this study is that for non-native contrasts presented without context, the acoustic salience of the sounds may initially matter more than the statistical distribution of the sounds. At least in the initial stages of learning, distributional evidence may play only a minor role. Future studies can investigate whether increased input might introduce robust effects of exposure to a bimodal distribution.